# Network-on-chip (NOC) 

## Topologies

## Network Topology

$\square$ Static arrangement of channels and nodes in an interconnection network

- The roads over which packets travel

■ Topology chosen based on cost and performance
$\Rightarrow$ Cost and performance determided by many factors (flow control, routing, traffic)
$\Rightarrow$ Measures to evaluate just the topology
$\checkmark$ Bisection bandwidth
$\checkmark$ Channel load
$\checkmark$ Path delay

## Factors Affecting Perfomance

$\square$ Factors that influence the performance of a NoC are
$\rightarrow$ Topology (static arrangement of channels and nodes)
$\rightarrow$ Routing Technique (selection of a path through the network)
$\rightarrow$ Flow Control (how are network resources allocated, if packets traverse the network)
$\rightarrow$ Router Architecture (buffers, switches, ...)
$\rightarrow$ Traffic Pattern

## Direct and Indirect Networks

■ Direct Network
$\Rightarrow$ Every Node in the network is both a terminal and a switch


■ Indirect Network
$\rightarrow$ Nodes are either switches or terminal


Indirect Network

## Direct Networks

■ aka point-to-point network

- Consists of a set of nodes, each one being directly connected to a (usually small) subset of other nodes in the network
$\rightarrow$ These nodes may have different functional capabilities

$\checkmark$ E.g., vector processors, graphics processors, I/O processors, etc.


## Direct Networks - Router

- A common component of the node is the router
$\Rightarrow$ It handles message communication among nodes
$\checkmark$ For this reason, direct networks are also known as router-based networks
$\rightarrow$ Each router has direct
 connections to the router of its neighbors


## Direct Networks - Links

- Two neighboring nodes are connected by a pair of unidirectional channels in opposite directions
- A bidirectional channel may also be used to connect two neighboring nodes


## Direct Networks - Scalability

$\square$ As the number of nodes in the system increases, the total communication bandwidth also increase
$\rightarrow$ Thus, direct networks have been a popular interconnection architecture for constructing large-scale parallel computers

## Direct Networks - Topologies

■ Many network topologies have been proposed in terms of their graph-theoretical properties
$\rightarrow$ Very few of them have ever been implemented
$\rightarrow$ Most of the implemented networks have an orthogonal topology

## DN - Orthogonal Topology

■ A network topology is orthogonal if and only if nodes can be arranged in an orthogonal $n$ dimensional space, and every link can be arranged in such a way that it produces a displacement in a single dimension
■ Orthogonal Topologies
$\rightarrow$ Strictly orthogonal topology
$\checkmark$ Every node has at least one link crossing each dimension
$\rightarrow$ Weakly orthogonal topology
$\checkmark$ Some nodes may not have any link in some dimensions

## DN - Strictly Orthogonal Topologies

$\square$ Routing is very simple
$\rightarrow$ Can be efficiently implemented in hardware
$\square$ Most popular strictly orthogonal direct network topologies
$\Rightarrow n$-dimensional mesh
$\Rightarrow k$-ary $n$-cube (torus)
$\Rightarrow$ Hypercube

## $n$-Dimensional Mesh

$\square$ It has $K_{0} \times K_{1} x \ldots \times K_{n-1}$ nodes, $K_{\mathrm{i}}$ nodes along each dimension $i$
$\square$ Two nodes $X$ and $Y$ are neighbors if and only if $y_{\mathrm{i}}=x_{\mathrm{i}}$ for all $i, 0 \leq i \leq n-1$, except one,
$j$, where $y_{j}=x_{j} \pm 1$
Thus, nodes have from $n$ to $2 n$ neighbors, depending on their location in the mesh
$\checkmark$ Therefore, this topology is not regular

## $n$-Dimensional Mesh



3-dimensional mesh

## k-ary n-cube

$\square$ All nodes have the same number of neighbors
It has $K^{n}$ nodes
$\square$ Two nodes $X$ and $Y$ are neighbors if and only if $y_{\mathrm{i}}=x_{\mathrm{i}}$ for all $\mathrm{i}, 0 \leq i \leq \mathrm{n}-1$, except one, $j$, where $y_{j}=\left(x_{j} \pm 1\right) \bmod K$
$\rightarrow$ Modular arithmetic adds wraparound channels $\checkmark$ Therefore, this topology is regular

## k-ary n-cube



It is a special case of both $n$-dimensional meshes and $k$-ary $n$-cubes
$\square$ A hypercube is an n-dimensional mesh in which $K_{\mathrm{i}}=2$ for $0 \leq i \leq \mathrm{n}-1$, or a 2-ary $n$ cube
$\rightarrow$ This topology is regular


## Other Direct Network Topologies

- Aimed at minimizing the network diameter

■ Every node but the root has a single parent node
$\Rightarrow$ Trees contain no cycles
■ $k$-ary tree
$\rightarrow$ A tree in which every node but the leaves has a fixed number $k$ of descendants

- Balanced tree
$\Rightarrow$ The distance from every leaf node to the root is the same



## Drawbacks of Trees

$\square$ Root node and the nodes close to it become a bottleneck
$\rightarrow$ Allocating a higher channel bandwidth to channels located close to the root node
$\checkmark$ Using channels with different bandwidths is not practical, especially when message transmission is pipelined
■ There are no alternative paths between any pair of nodes

## Indirect Networks

- The communication between any two nodes is carried through some switches
$\square$ Each node has a network adapter that connects to a network switch
-The interconnection of those switches defines various network topologies


Indirect Network

## Topology \& Physical Constraints

- It is important to model the relationships between physical constraints and topology
$\rightarrow$ And the resulting impact on performance
$\square$ Network optimization is the process of utilizing these models
$\rightarrow$ For selecting topologies that best match the physical constraints of the implementation
$\square$ For a given implementation technology, physical constraints determine architectural features
- Channel widths
$\checkmark$ Impact on zero-load latency


## Bisection Width/Bandwidth

■ One of the physical constraints facing the implementation of interconnection networks is the available wiring area

- The available wiring area is determined by the packaging technology
$\Rightarrow$ Whether the network resides on a chip, multichip module, or printed circuit board
■ VLSI systems are generally wire limited
$\Rightarrow$ The silicon area required by these systems is determined by the interconnect area, and the performance is limited by the delay of these interconnections
- The choice of network dimension is influenced by how well the resulting topology makes use of the available wiring area
$\Rightarrow$ One such performance measure is the bisection width


## Cuts

- A cut of a network, $C\left(N_{1}, N_{2}\right)$, is a set of channels that partitions the set of all nodes into two disjoint sets, $N_{1}$ and $N_{2}$
$\Rightarrow$ Each element in $C\left(N_{1}, N_{2}\right)$ is a channel with a source in $N_{1}$ and destination in $N_{2}$ or vice versa



## Bandwidth of the Cut

## - Total bandwidth of the cut $C\left(N_{1}, N_{2}\right)$

$$
B\left(N_{1}, N_{2}\right)=\sum_{c \in C\left(N_{1}, N_{2}\right)} b_{c}
$$

## Bisection

$\square$ The bisection is a cut that partitions the entire network nearly in half
$\square$ The channel bisection of a network, $B_{C}$, is the minimum channel count over all bisections

$$
B_{C}=\min _{\text {bisections }}\left|C\left(N_{1}, N_{2}\right)\right|
$$

- The bisection bandwidth of a network, $B_{\mathrm{B}}$, is the minimum bandwidth over all bisections

$$
B_{B}=\min _{\text {bisections }}\left|B\left(N_{1}, N_{2}\right)\right|
$$

## Bisection Examples



## Diameter

- The diameter of a network, $H_{\text {max }}$, is the largest, minimal hop count over all pairs of terminal nodes

$$
H_{\max }=\max _{\mathrm{x}, \mathrm{y} \in N}|H(x, y)|
$$

For a fully connected network with $N$ terminals built from switches with out degree $\delta_{0}, H_{\text {max }}$ is bounded by

$$
\begin{equation*}
H_{\max } \geq \log _{\delta_{o}} N \tag{1}
\end{equation*}
$$

Each terminal can reach at most $\delta_{0}$ other terminals after one hop
At most $\delta_{0}{ }^{2}$ after two hops, and at most $\delta_{0}{ }^{H}$ after $H$ hops If we set $\delta_{0}{ }^{H}=N$ and solve for $H$, we get (1)

## Average Minimum Hop count

$\square$ The average minimum hop count of a network, $H_{\text {min }}$, is defined as the average hop count over all sources and destinations

$$
H_{\min }=\frac{1}{N^{2}} \sum_{x, y \in N} H(x, y)
$$

## Physical Distance and Delay

■The physical distance of a path is

$$
D(P)=\sum_{c \in P} l_{c}
$$

- The delay of a path is

$$
t(P)=D(P) / v
$$

## Performance

- Throughput
$\Rightarrow$ Data rate in bits/s that the network accepts per input port
$\rightarrow$ It is a property of the entire network
$\rightarrow$ It depends on
$\checkmark$ Routing
$\checkmark$ Flow control
$\checkmark$ Topology


## deal Throughput

■ /deal throughput of a topology
$\Rightarrow$ Throughput that the network could carry with perfect flow control (no contention) and routing (load balanced over alternative paths)
■ Maximum throughput
$\Rightarrow$ It occurs when some channel in the network becomes saturated
$\square$ We suppose for semplicity that all the channel bandwidths are b

## Channel Load

$\square$ We define the load of a channel $c, \gamma_{c}$, as

## $\gamma_{c}=\frac{\text { bandwidth demanded from channel } c}{\text { bandwidth of the input ports }}$

■ Equivalently
$\Rightarrow$ Amount of traffic that must cross $c$ if each input injects one unit of traffic
■ Of course, it depends on the traffic pattern considered
$\Rightarrow$ We will assume uniform traffic

## Maximum Channel Load

■ Under a particular traffic pattern, the channel that carries the largest fraction of traffic ( the bottleneck channel) determines the maximum channel load $\gamma_{\text {max }}$ of the topology

$$
\gamma_{\max }=\max _{c \in C} \gamma_{c}
$$

## Ideal Throughput

■ When the offered traffic reaches the throughput of the network, the load on the bottleneck channel will be equal to the channel bandwidth b
$\Rightarrow$ Any additional traffic would overload this channel

- The ideal throughput $\Theta_{\text {ideal }}$ is the input bandwidth that saturates the bottleneck channel

$$
\begin{aligned}
& \gamma_{c}=\frac{\text { bandwidth demanded from channel } c}{\text { bandwidth of the input ports }} \\
& \gamma_{c}=\gamma_{\max }=\frac{b}{\Theta_{\text {ideal }}} \\
& \Theta_{\text {ideal }}=\frac{b}{\gamma_{\max }}
\end{aligned}
$$

## Bounds for $\gamma_{\max }$

$\square \gamma_{\max }$ is very hard to compute for the general case (arbitrary topology and arbitrary traffic pattern)
■ For uniform traffic some upper and lower bounds can be computed with much less effort

## Lower Bound on $\gamma$

- The load on the bisection channels gives a lower bound on $\gamma_{\text {max }}$
- Let us assume uniform traffic
$\Rightarrow$ On average, half of the traffic ( $N / 2$ packets) must cross the $B_{C}$ bisection channels
$\Rightarrow$ The best throughput occurs when these packets are distributed evenly across the bisection channels
$\Rightarrow$ Thus, the load on each bisection channel $\gamma_{B}$ is at least

$$
\gamma_{\max } \geq \gamma_{B}=\frac{N}{2 \mathrm{~B}_{C}}
$$



## Upper Bound on $\Theta_{\text {ideal }}$

- We found that

$$
\Theta_{\text {ideal }}=\frac{b}{\gamma_{\max }} \quad \text { and } \quad \gamma_{\max } \geq \gamma_{B}=\frac{N}{2 \mathrm{~B}_{C}}
$$

■ Combining the above equations we have

$$
\Theta_{\text {ideal }} \leq \frac{2 b B_{C}}{N}=\frac{2 \mathrm{~B}_{B}}{N}
$$

## atency

-The latency of a network is the time required for a packet to traverse the network
$\Rightarrow$ From the time the head of the packet arrives at the input port to the time the tail of the packet departs the output port

## Components of the Latency

$\square$ We separate latency, $T$, into two components
$\rightarrow$ Head latency $\left(T_{h}\right)$ : time required for the head to traverse the network
$\Rightarrow$ Serialization latency ( $T_{\mathrm{s}}$ ): time for a packet of length $L$ to cross a channel with bandwidth $b$

$$
T=T_{h}+T_{s}=T_{h}+\frac{L}{b}
$$

## Contributions

■ Like throughput, latency depends on
$\Rightarrow$ Routing
$\Rightarrow$ Flow control
$\Rightarrow$ Design of the router
$\Rightarrow$ Topology

## Latency at Zero Load

$\square$ We consider latency at zero load, $T_{0}$
$\rightarrow$ Latency when no contention occurs
$\square T_{\mathrm{h}}$ : sum of two factors determined by the topology
$\rightarrow$ Router delay ( $T_{r}$ ): time spent in the routers
$\rightarrow$ Time of flight ( $T_{w}$ ): time spent on the wires

$$
\begin{gathered}
T_{h}=T_{r}+T_{w}=H_{\min } t_{r}+\frac{D_{\min }}{v} \\
T_{0}=H_{\min } t_{r}+\frac{D_{\min }}{v}+\frac{L}{b}
\end{gathered}
$$

## Latency at Zero Load

## Technology

Topology

Node degree

## Packet Propagation

$\xrightarrow{\text { time }}$


## Case Study

- A good topology exploits characteristics of the available packaging technology to meet bandwidth and latency requirements of the application
■ To maximize bandwidth a topology should saturate the bisection bandwidth


## Bandwidth Analysis (Torus)



Assume: 256 signals @ 1Gbits/s

Bisection bandwidth 256 Gbits/s

## Bandwidth Analysis (Torus)

$\square 16$ unidirectional channels cross the midpoint of the topology

- To saturate the bisection of 256 signals
$\Rightarrow$ Each channel crossing the bisection should be 256/16 = 16 signals wide
$\square$ Constraints
$\rightarrow$ Each node packaged on a IC
$\checkmark$ Limited number of I/O pins (e.g., 128)
$\checkmark 8$ channels per node $\rightarrow 8 \times 16=128$ pins $\rightarrow$ OK


## Bandwidth Analysis (Ring)



■ 4 unidirectional channels cross the mid-point of the topology

- To saturate the bisection of 256 signals
$\Rightarrow$ Each channel crossing the bisection should be 256/4 = 64 signals wide
- Constraints
- Each node packaged on a IC
$\checkmark$ Limited number of I/O pins (e.g., 128)
$\checkmark 4$ channels per node $\rightarrow 4 \times 64=256$ pins $\rightarrow$ INVALID
$\Rightarrow$ With identical technology constraints, the ring provides only half the bandwidth of the torus


## Delay Analysis

■ The application requires only 16Gbits/s
> ...but also minimum latency
■ The application uses long 4,096-bit packets
■ Suppose random traffic
$\Rightarrow$ Average hop count
$\checkmark$ Torus $=2$
$\checkmark$ Ring $=4$
$\square$ Channel size
$\Rightarrow$ Torus $=16$ bits
$\Rightarrow$ Ring $=32$ bits

## Delay Analysis

$\square$ Serialization latency (channel speed 1GHz)
$\Rightarrow$ Torus $=4,096 / 16$ * $1 \mathrm{~ns}=256 \mathrm{~ns}$
$\Rightarrow$ Ring $=4,096 / 32 * 1 \mathrm{~ns}=128 \mathrm{~ns}$

- Latency assuming 20ns hop delay
$\rightarrow$ Torus $=256+20 * 2=296$ ns
$\Rightarrow$ Ring $=128+20 * 4=208 \mathrm{~ns}$
$\square$ No one topology is optimal for all applications Different topologies are appropriate for different constraints and requirements

